By the end of this article, I will show how to glean competitive intelligence from Eat24's website with public data and find a good guess at the number of restaurants they may have had (or at least the number of restaurant ids they've issued).
This article covers using Bayesian statistics to identify likely hypotheses in the face of extreme uncertainty.
How do we know about the unknown? How can we take the small amount we learn and know more?
Here's an interesting example: Let's say a friend has 5 different types of dice: 4-sided, 6-sided, 8-sided, 12-sided, and 20-sided. Your friend chooses one die without you seeing (and will stick to it throughout this example). Without knowing anything else... how likely is each die?
In order to answer this question I constructed the following class which is a simplification of Allen Downey's own Pmf class which he uses in the book "Think Bayes". I will cover each method but for the time being let's just make sure we understand the constructor and the update method.
In [6]:
class Pmf():
def __init__(self, hypotheses):
self._hypotheses = hypotheses
number_of_hypotheses = len(self._hypotheses)
uninformative_prior_distribution = [1/(1.0*len(self._hypotheses))]*number_of_hypotheses
self._likelihoods = np.array(uninformative_prior_distribution)
def update(self, data):
for i in range(0, len(self._hypotheses)):
hypothesis = self._hypotheses[i]
if hypothesis < data:
self._likelihoods[i] = 0.0
else:
self._likelihoods[i] = self._likelihoods[i] * 1/(1.0*hypothesis)
self._likelihoods = self._likelihoods/sum(self._likelihoods)
def likely_interval(self):
p_value = .05
total_likelihood = 0.0
index = -1
while (1-total_likelihood) > p_value:
index += 1
total_likelihood += self._likelihoods[index]
left_hypothesis_index = 0
for i in self._hypotheses:
left_hypothesis_index = i
if self._likelihoods[i] > 0.0:
break
return (left_hypothesis_index, self._hypotheses[index])
def likely_average(self):
average = 0
for i in range(0, len(self._hypotheses)):
average += self._hypotheses[i]*self._likelihoods[i]
return average
Starting with the constructor we call it like so:
In [ ]:
pmf = Pmf(hypotheses=[4, 6, 8, 12, 20])
The hypotheses parameter is set to a list of all possible hypotheses for which die our friend is rolling. If we wanted to add in a 30-sided die we could just by adding ', 30' to the end of that list. For now, let's keep with what we have.
The constructor then takes that set of hypotheses and constructs an uninformative prior likelihood for each hypothesis. An uninformative prior is always just $\frac{1}{number of hypotheses}$. In our example then, each of our 5 hypotheses has a $\frac{1}{5}$ likelihood given no other information. The below code is an example which illustrates this in code and with a bar chart:
In [13]:
import matplotlib.pyplot as plt
pmf = Pmf(hypotheses=[4, 6, 8, 12, 20])
# Graph our initial beliefs (prior to updating anything)
plt.figure(figsize = (8,8))
plt.ylim([0,1])
plt.xlim([4,21])
plt.xlabel('N-sided die hypothesis')
plt.ylabel('Likelihood of hypothesis')
plt.title('Likelihood distribution of unknown die prior to rolling')
plt.bar(pmf._hypotheses, pmf._likelihoods)
print 'Likelihood of 4 sided: ' + str(pmf._likelihoods[0])
print 'Likelihood of 6 sided: ' + str(pmf._likelihoods[1])
print 'Likelihood of 8 sided: ' + str(pmf._likelihoods[2])
print 'Likelihood of 12 sided: ' + str(pmf._likelihoods[3])
print 'Likelihood of 20 sided: ' + str(pmf._likelihoods[4])
Each die has the same likelihood... 20%.
Now let's update our beliefs with one more piece of information. Now let's say your friend rolls the die (with you blindfolded) and tells you they rolled a 5. Now what do your beliefs look like?
I've constructed the update function to take the value of the number that gets rolled based on two ideas. If you look at lines 4 & 5 you can see that if the number I roll is larger than my hypothesis I can completely discount that hypothesis. To be fair we could also take into account the likelihood of my friend lying but let's just assume my friend is 100% honorable.
So first we try to discount the hypothesis, and then if we can't do that then we update our likelihood by multiplying our current likelihood by the naive likelihood that this hypothesis is correct. In the case of the dice the probability that I roll some number on a die that has at least that many sides N is $\frac{1}{N}$. Here's a reminder of what the update method looks like:
In [ ]:
def update(self, data):
for i in range(0, len(self._hypotheses)):
hypothesis = self._hypotheses[i]
if hypothesis < data:
self._likelihoods[i] = 0.0
else:
self._likelihoods[i] = self._likelihoods[i] * 1/(1.0*hypothesis)
self._likelihoods = self._likelihoods/sum(self._likelihoods)
Thus if our prior belief was $\frac15$ then our updated belief will be $\frac15 * \frac1N$. If you recall, Bayes law says $P(A | B) = \frac{P(B|A)P(A)}{P(B)}$. We haven't divided by anything yet though. To account for this, when we're done updating all of our likelihoods we normalize them so that they all add up to zero. That's what the last line in the update method is doing for us.
That's the basics of the code... and now we can continue with our example and see what happens after we roll a 5.
In [14]:
pmf.update(5) # Roll the mystery die and see what we get
plt.ylim([0,1])
plt.xlim([4,21])
plt.xlabel('N-sided die hypothesis')
plt.ylabel('Likelihood of hypothesis')
plt.title('Likelihood distribution of unknown die after rolling once')
plt.bar(pmf._hypotheses, pmf._likelihoods)
print 'Likelihood of 4 sided: ' + str(pmf._likelihoods[0])
print 'Likelihood of 6 sided: ' + str(pmf._likelihoods[1])
print 'Likelihood of 8 sided: ' + str(pmf._likelihoods[2])
print 'Likelihood of 12 sided: ' + str(pmf._likelihoods[3])
print 'Likelihood of 20 sided: ' + str(pmf._likelihoods[4])
Notice how we've eliminated the four-sided die entirely? Our most likely candidate now is a 6 sided die with the 20 sided die being the least likely option.
Now let's pretend our friend rolls several times.
In [15]:
pmf.update(8) # Roll the mystery die and see what we get
pmf.update(3)
pmf.update(7)
plt.ylim([0,1])
plt.xlim([4,21])
plt.xlabel('N-sided die hypothesis')
plt.ylabel('Likelihood of hypothesis')
plt.title('Likelihood distribution of unknown die after rolling three more times')
plt.bar(pmf._hypotheses, pmf._likelihoods)
print 'Likelihood of 4 sided: ' + str(pmf._likelihoods[0])
print 'Likelihood of 6 sided: ' + str(pmf._likelihoods[1])
print 'Likelihood of 8 sided: ' + str(pmf._likelihoods[2])
print 'Likelihood of 12 sided: ' + str(pmf._likelihoods[3])
print 'Likelihood of 20 sided: ' + str(pmf._likelihoods[4])
Since a 7 and 8 were rolled we can entirely discount the 6-sided die. This has resulted in an 8-sided die being ~80% likely. It's still quite possible that our friend is rolling a 12-sided die (though unlikely) and pretty improbably that they're rolling a 20 sided die.
So now... what if we extend this example to competitive intelligence? Imagine you want to know how many restaurant ids Eat24 has. What if we randomly sampled the restaurants on their web site and treated the restaurant ids in the URL as "dice rolls"? Of course this requires that we assume the ids are sequential and only created when Eat24 actually has a restaurant.
Also we need to decide on a somewhat reasonable upper bound for the number of restaurants they might have. Knowing very little about this stuff I chose 150,000. That means I have 150k hypotheses to test. In our previous example this would have been like saying my friend had 150k different dice to choose from each with N sides (where N is a number between 1 and 150k). Here's the code to analyze this scenario:
In [18]:
plt.figure(figsize=(8,8))
# Create my Bayesian modeling object
eat_24 = Pmf(range(0, 150000))
# Update beliefs with observed restaurant ids
eat_24.update(4667) # Chicago, IL
eat_24.update(17627) # NY, New York
eat_24.update(20180) # Southern California
eat_24.update(8259) # Boston, MA
plt.xlabel('# of possible restaurants')
plt.ylabel('% likelihood given estimate is correct (where .01 = 1%)')
plt.title('Likelihood distribution for number of restaurants Eat24 supports')
plt.bar(eat_24._hypotheses, eat_24._likelihoods)
print "95% chance of max restaurant id to be between: " + str(eat_24.likely_interval())
print "Our most likely center hypothesis: " + str(eat_24.likely_average())
This means that we'd expect the number of restaurant ids to be somewhere around 29/30k and that it's reasonable (given the above info) that the number might fall between 20k and 54k.
When I fact checked this with Tom and Lev, it turns out that the maximum id we have for Eat24 is 34,803. Not too shabby given only 4 data points! :)
As far as active restaurants go, since these are just ids we have no idea what gaps might exist due to attrited restaurants or just oddness in how they issue the ids.
But there you have it! Bayes in action.
NOTE: This will become a blog post but with the GrubHub/Eat24 specifics replaced with something a bit more general public knowledge. Don't share this outside of GrubHub. Inside GrubHub feel free as an example of Bayesian statistics in action.